Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-101362: Optimise pathlib by deferring path normalisation #101560

Closed
wants to merge 16 commits into from

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented Feb 4, 2023

PurePath now normalises and splits paths only when necessary, e.g. when .name or .parent is accessed. The result is cached. This speeds up path object construction by around 4x.

PurePath.__fspath__() now returns an unnormalised path, which should be transparent to filesystem APIs (else pathlib's normalisation is broken!). This extends the earlier performance improvement to most impure Path methods, and also speeds up p.joinpath('bar') and p / 'bar'. edit: will fix separately.

This also fixes GH-76846 and GH-85281 by unifying path constructors and adding an __init__() method. edit: will fix separately.

`PurePath` now normalises and splits paths only when necessary, e.g. when
`.name` or `.parent` is accessed. The result is cached. This speeds up path
object construction by around 4x.

`PurePath.__fspath__()` now returns an unnormalised path, which should be
transparent to filesystem APIs (else pathlib's normalisation is broken!).
This extends the earlier performance improvement to most impure `Path`
methods, and also speeds up pickling, `p.joinpath('bar')` and `p / 'bar'`.

This also fixes pythonGH-76846 and pythonGH-85281 by unifying path constructors and
adding an `__init__()` method.
@barneygale
Copy link
Contributor Author

barneygale commented Feb 4, 2023

Constructing path objects is up to 4x faster with one argument:

$ ./python -m timeit -n 1000000 -s 'from pathlib import PurePath' 'PurePath("foo/bar")' 
1000000 loops, best of 5: 2.01 usec per loop  # before
1000000 loops, best of 5: 495 nsec per loop  # after

More than 2x faster with two arguments:

$ ./python -m timeit -n 1000000 -s 'from pathlib import PurePath' 'PurePath("foo", "bar")' 
1000000 loops, best of 5: 2.28 usec per loop  # before
1000000 loops, best of 5: 1.02 usec per loop  # after

~~And ~25% faster when joining arguments:~~

[edit: no longer true! ]

$ ./python -m timeit -n 1000000 -s 'from pathlib import PurePath; p = PurePath("foo")' 'p.joinpath("bar")' 
1000000 loops, best of 5: 1.66 usec per loop  # before
1000000 loops, best of 5: 1.3 usec per loop  # after

But it's 12% slower when the path needs normalization, as with str()

$ ./python -m timeit -n 1000000 -s 'from pathlib import PurePath' 'str(PurePath("foo/bar"))' 
1000000 loops, best of 5: 2.96 usec per loop  # before
1000000 loops, best of 5: 3.31 usec per loop  # after

And 25% slower when when walking directories (where pathlib keeps everything normalized):

[edit: resolved! see comment]

$ ./python -m timeit -n 20 -s 'from pathlib import Path' 'list(Path().rglob("*"))' 
20 loops, best of 5: 53.4 msec per loop  # before
20 loops, best of 5: 66.5 msec per loop  # after

But still faster for filesystem operations that don't require normalization:

[edit: no longer true! this can't be properly fixed until other stuff lands]

$ ./python -m timeit -n 100000 -s 'from pathlib import Path' 'Path("README.rst").read_text()' 
100000 loops, best of 5: 26.1 usec per loop  # before
100000 loops, best of 5: 21.2 usec per loop  # after

$ ./python -m timeit -n 100000 -s 'from pathlib import Path' 'Path("README.rst").exists()' 
100000 loops, best of 5: 5.45 usec per loop  # before
100000 loops, best of 5: 2.97 usec per loop  # after

@barneygale barneygale marked this pull request as ready for review February 4, 2023 18:28
@barneygale barneygale marked this pull request as draft February 7, 2023 20:35
@barneygale
Copy link
Contributor Author

I've found a couple other small optimizations which are best tackled in other PRs, so I'm marking this PR as a 'draft' for now.

@barneygale barneygale changed the title GH-101362 - Optimise pathlib by deferring path normalisation GH-101362: Optimise pathlib by deferring path normalisation Mar 6, 2023
@AlexWaygood AlexWaygood added the performance Performance or resource usage label Mar 6, 2023
@barneygale
Copy link
Contributor Author

I've undone the change to _from_parsed_parts(), which has restored directory-walking performance:

$ ./python -m timeit -n 20 -s 'from pathlib import Path' 'list(Path().rglob("*"))' 
20 loops, best of 5: 146 msec per loop  # before
20 loops, best of 5: 152 msec per loop  # after

Still a tiny bit slower than pre-PR.

The rest of the speedups/slowdowns mentioned in my previous comment are still there.

@barneygale barneygale marked this pull request as ready for review March 6, 2023 02:33
@barneygale
Copy link
Contributor Author

The change to importlib is necessary because it's relying on a bug in pathlib's path normalization:

I think I need to solve that issue first, so I'm going to mark this PR as a draft (again!)

@barneygale barneygale marked this pull request as draft March 11, 2023 23:32
@barneygale barneygale marked this pull request as ready for review March 17, 2023 16:20
@barneygale barneygale marked this pull request as draft March 17, 2023 16:45
@barneygale
Copy link
Contributor Author

This PR has strayed too far from the original implementation. I'm going to abandon it. New PR here:

@barneygale barneygale closed this Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pathlib.Path._from_parsed_parts should call cls.__new__(cls)
3 participants